A Hybrid Word Alignment Approach to Improve Translation Lexicons with Compound Words and Idiomatic Expressions

نویسنده

  • Nasredine Semmar
چکیده

In this paper, we present a hybrid approach to align single words, compound words and idiomatic expressions from bilingual parallel corpora. The objective is to develop, improve and maintain automatically translation lexicons. This approach combines linguistic and statistical information in order to improve word alignment results. The linguistic improvements taken into account refer to the use of an existing bilingual lexicon, named entities recognition, grammatical tags matching and detection of syntactic dependency relations between words. Statistical information refer to the number of occurrences of repeated words, their positions in the parallel corpus and their lengths in terms of number of characters. Single-word alignment uses an existing bilingual lexicon, named entities and cognates detection and grammatical tags matching. Compound-word alignment consists in establishing correspondences between the compound words of the source sentence and the compound words of the target sentences. A syntactic analysis is applied on the source and target sentences in order to extract dependency relations between words and to recognize compound words. Idiomatic expressions alignment starts with a monolingual term extraction for each of the source and target languages, which provides a list of sequences of repeated words and a list of potential translations. These sequences are represented with vectors which indicate their numbers of occurrences and the numbers of segments in which they appear. Then, the translation relation between the source and target expressions are evaluated with a distance metric. The single and compound word aligners have been evaluated on a subset of 1103 sentences in English and French of the JOC (Official Journal of the European Community) corpus . The obtained results showed that these aligners generate a translation lexicon with 90 % of precision for single words and 84 % of precision for compound words. We evaluated the idiomatic expressions aligner on a subset of the Canadian Parliament Hansard corpus and we obtained a precision of 81%.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Using Transliteration of Proper Names from Arabic to Latin Script to Improve English-Arabic Word Alignment

Bilingual lexicons of proper names play a vital role in machine translation and cross-language information retrieval. Word alignment approaches are generally used to construct bilingual lexicons automatically from parallel corpora. Aligning proper names is a task particularly difficult when the source and target languages of the parallel corpus do not share a same written script. We present in ...

متن کامل

(Un)Translatability of Persian Idiomatic Expressions to English in Political Discourse

The present study sought to investigate the extent to which Persian idiomatic expressions would influence the western translators' strategies in providing the ultimate product in English, and it also attempted to uncover the underlying assumptions in target text, then to suggest some weighty strategies to overcome difficulties with translation. For this purpose, the data was analyzed within the...

متن کامل

Identifying Idiomatic Expressions Using Automatic Word-Alignment

For NLP applications that require some sort of semantic interpretation it would be helpful to know what expressions exhibit an idiomatic meaning and what expressions exhibit a literal meaning. We investigate whether automatic word-alignment in existing parallel corpora facilitates the classification of candidate expressions along a continuum ranging from literal and transparent expressions to i...

متن کامل

Study of the impact of proper name transliteration on the performance of word alignment in French-Arabic parallel corpora (Etude de l'impact de la translittération de noms propres sur la qualité de l'alignement de mots à partir de corpus parallèles français-arabe) [in French]

Bilingual lexicons play a vital role in cross-language information retrieval and machine translation. The manual construction of these lexicons is often costly and time consuming. Word alignment techniques are generally used to construct bilingual lexicons from parallel texts. Aligning single words and nominal syntagms from parallel texts is relatively a well controlled task for languages using...

متن کامل

Extracting Translation Lexicons from Bilingual Corpora: Application to South-Slavonic Languages

The paper presents a novel approach for automatic translation lexicon extraction from a parallel sentence-aligned corpus. This is a five-step process, which includes cognate extraction, word alignment, phrase extraction, statistical phrase filtering, and linguistic phrase filtering. Unlike other approaches whose objective is to extract word or phrase pairs to be used in machine translation, we ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2010